The Design for the Wall Street Journal-based CSR Corpus
نویسندگان
چکیده
The DARPA Spoken Language System (SLS) community has long taken a leadership position in designing, implementing, and globally distributing significant speech corpora widely used for advancing speech recognition research. The Wall Street Journal (WSJ) CSR Corpus described here is the newest addition to this valuable set of resources. In contrast to previous corpora, the WSJ corpus will provide DARPA its first general-purpose English, large vocabulary, natural language, high perplexity, corpus containing significant quantities of both speech data (400 hrs.) and text data (47M words), thereby providing a means to integrate speech recognition and natural language processing in application domains with high potential practical value. This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus. I N T R O D U C T I O N As spoken language technology progresses and goals expand, progressively larger, and more challenging corpora need to be created to support advanced research. The SLS DARPA 1994 goals are ambitious, focusing on cooperative speakers, generating goal-directed, spontaneous continuous speech, in speaker-adaptive and speaker-independent modes, for expandable vocabularies (5000 or more words active), moderate perplexity (100-200), with integrated speech and natural language processing, for speakers in a moderate noise environment, using multiple types of microphones, engaged i n command/database and dictation applications. In contrast to typical command/database applications, dictation (i.e. interactive speech-driven word processing) tasks focus on cooperative speakers (e.g. speaker dependent/adaptlve sustained usage) who generate continuous speech (usually in a somewhat careful fashion to facilitate accurate transcription) verbalizing their words and sentence punctuation. The existing Resource Management[15] and subsequent Air Travel Information System[16] corpora target specific database inquiry tasks, characterized by medium vocabularies (<1500 words) with language model perplexities ranging from 9 to 60. The WSJ corpus described here is designed to advance CSR technology and support the 1994 SLS research goals. A similar read speech corpus in the French language has been success*This work was sponsored by the Defense Advanced Research Projects Agency. The views expressed are those of the authors and do not reflect the official policy or position of the U.S. Government. fully completed using text from the newspaper Le Monde[5]. Commencing with serious contractor concerns regarding suitable CSR corpora[12] starting in the mid 1980's, the DARPA SLS Coordinating Committee started considering new corpora requirements in early 1990, with the subsequent formation of the CSR Corpus Committee, culminating in the WSJ Corpus design. The CSR Corpus Committee members include J.M. Baker (Dragon, chair), F. Kubala (BBN), D. Pallett (NIST), D. Paul (LL), M. Phillips (MIT), M. Picheny (IBM), R. Rajasekran (TI), B. Weide (CMU), M. Weintraub (SRI), and 3. Wilpon (ATT). A survey taken of the DARPA contractors for CSR research interests disclosed highly diverse, often opposing views of research interest. All contractors, however, cited a common interest in pursuing research on "Domain-independent Acoustic Models", "Domain-independent Language Models", and "Speaker-adaptation". The outcome of lively meetings and discussions resulted in the definition and preliminary authorization of a major (>400 hrs.) corpus with materials based primarily on WSJ material (backed by WSJ text from 1987-89 provided by the ACL/DCI[9] to enable statistical language modeling) and supplemented by other material (spontaneous dictation, Hansard, etc., shown in Table 1). This corpus will provide a uniquely rich resource, in a carefully crafted structure designed to elicit a highly productive flow of diagnostic research information with an array of comparative test paradigms. Although this WSJ corpus is large relative to many other available corpora, it should be cautioned that insofar as most research experiments continue to show marked improvement with the increased availability of training data, it is likely that this corpus also will fail to allow us to find or achieve asymptotic performance. Most systems continue to be undertrained or constrained to work in suboptimal lower dimensional spaces, due to their data-starvation. Indeed, this result is not really surprising in light of the much larger amounts of speech data to which young children must be exposed before gaining recognition proficiency of even modest size vocabularies. The structure, features, and dimensions of this corpus constitute the outcome of a heavily debated consensus process, which satisfies the basic (though certainly not all) different requirements of the different research loci of all parties involved. There are significant portions of this corpus which
منابع مشابه
DARPA February 1992 Pilot Corpus CSR "Dry Run" Benchmark Test Results
Continuous speech recognition research activities within the DARPA Spoken Language community have, within the past several years, been focussed on the Resource Management (RM) and Air Travel Information System (ATIS) corpora. Within the past year, plans have been developed for a large, multi-component "general-purpose English, large vocabulary, natural language, high perplexity corpus" known as...
متن کاملSpontaneous Speech Collection for the CSR Corpus
As part of a pilot data collection for DARPA's Continuous Speech Recognition (CSR) speech corpus, SRI International experimented with the collection of spontaneous speeoh material. The bulk of the CSR pilot data was read versions of news articles from the Wall Street Journal (WSJ), and the spontaneous sentences were to be similar material, but spontaneously dictated. In the first pilot portion ...
متن کاملBenchmark Tests For The Darpa Spoken Language Program
This paper documents benchmark tests implemented within the DARPA Spoken Language Program during the period November, 1992 January, 1993. Tests were conducted using the Wall Street Journal-based Continuous Speech Recognition (WSJ-CSR) corpus and the Air Travel Information System (ATIS) corpus collected by the Multi-site ATIS Data COllection Working (MADCOW) Group. The WSJ-CSR tests consist of t...
متن کاملSegmental Neural Net Optimization for Continuous Speech Recognition
Previously, we had developed the concept of a Segmental Neural Net (SNN) for phonetic modeling in continuous speech recognition (CSR). This kind of neural network technology advanced the state-of-the-art of large-vocabulary CSR, which employs Hidden Marlcov Models (HMM), for the ARPA 1oo0-word Resource Management corpus. More Recently, we started porting the neural net system to a larger, more ...
متن کاملSpeaker-independent continuous speech dictation
In this paper we report progress made at LIMSI in speaker-independent large vocabulary speech dictation using newspaper speech corpora. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. Acoustic modeling uses cepstrum-based features, contextdependent phone models (intra and...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1992